Audio encoding device and method

ABSTRACT

A method and a device encode N audio signals, from N microphones where N≥3. For each pair of the N audio signals an angle of incidence of direct sound is estimated. A-format direct sound signals are derived from the estimated angles of incidence by deriving from each estimated angle an A-format direct sound signal. Each A-format direct sound signal is a first-order virtual microphone signal, for example, a cardioids signal.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of International Patent ApplicationNumber PCT/EP2018/056411, filed on Mar. 14, 2018, the disclosure ofwhich is hereby referenced in its entirety.

FIELD

The present disclosure is related to audio recording and encoding, inparticular for virtual reality applications, especially for virtualreality provided by a small portable device.

BACKGROUND

Virtual reality (VR) sound recording typically requires AmbisonicB-format with expensive directive microphones. Professional audiomicrophones exist to either record A-format to be encoded into AmbisonicB-format or directly Ambisonic B-format, for instance using Soundfieldmicrophones. More generally speaking, it is technically difficult toarrange omnidirectional microphones on a mobile device to capture soundfor VR.

A way to generate Ambisonic B-format signals, given a distribution ofomnidirectional microphones, is based on differential microphone arrays,i.e. applying delay and adding beam-forming in order to derive firstorder virtual microphone (e.g. cardioids) signals as A-format.

The first limitation of this technique results from its spatial aliasingwhich, by design, reduces the bandwidth to frequencies f in the range:

$\begin{matrix}{{f < \frac{c}{4d_{mic}}},} & (1)\end{matrix}$where c stands for the sound celerity and d_(mic) the distance between apair of two omnidirectional microphones. A second weakness results, forhigher order Ambisonic B-format, from the microphone requirement. Therequired number of microphones and their required positions are notanymore suitable for mobile devices.

Another way of generating ambisonic B-format signals fromomnidirectional microphones corresponds to sampling the sound field atthe recording point in space using a sufficiently dense distribution ofmicrophones. These sampled sound pressure signals are then converted tospherical harmonics, and can be linearly combined to eventually generateB-format signals.

The main limitation of such approaches is the required number ofmicrophones. For consumer applications, with only few microphones(commonly up to 6), linear processing is too limited, leading to signalto noise ratio (SNR) issues at low frequencies, and aliasing at highfrequencies.

Directional Audio Coding (DirAc) is a further method for spatial soundrepresentation, but it does not generate B-format signals. Instead, itreads first order B-format signals and generates a number of relatedaudio parameters (direction of arrival, diffuseness) and adds these toan omnidirectional audio channel. Later, the decoder takes the aboveinformation and converts it to a multi-channel audio signal usingamplitude panning for direct sound and de-correlating for diffuse sound.

DirAc is thus a different technique, which takes B-format as input torender it to its own audio format.

SUMMARY

Therefore, the present inventors have recognized a need to provide anaudio encoding device and method, which allow for generating ambisonicB-format sound signals, while requiring only a low number ofmicrophones, and achieving a high output sound quality.

Embodiments of the present disclosure provide such audio encodingdevices and methods that allow for generating ambisonic B-format soundsignals, while requiring only a low number of microphones, and achieve ahigh output sound quality.

According to a first aspect of the present disclosure, an audio encodingdevice, for encoding N audio signals, from N microphones, where N≥3, isprovided. The device comprises a delay estimator, configured to estimateangles of incidence of direct sound by estimating for each pair of the Naudio signals an angle of incidence of direct sound, and a beam deriver,configured to derive A-format direct sound signals from the estimatedangles of incidence by deriving from each estimated angle of incidencean A-format direct sound signal, each A-format direct sound signal beinga first-order virtual microphone signal, especially a cardioids signal.This allows for determining the A-format direct sound signals with a lowhardware effort.

According to an implementation form of the first aspect, the deviceadditionally comprises an encoder, configured to encode the A-formatdirect sound signals in first-order ambisonic B-format direct soundsignals by applying a transformation matrix to the A-format direct soundsignals. This allows for generating ambisonic B-format signals usingonly a very low number of microphones, but still achieving a high outputsound quality.

According to an implementation form of the first aspect, N=3. The audioencoding device moreover comprises a short time Fourier transformer,configured to perform a short time Fourier transformation on each of theN audio signals x₁, x₂, x₃, resulting in N short time Fouriertransformed audio signals X₁[k,i], X₂[k,i], X₃[k,i]. The delay estimatoris then configured to determine cross spectra of each pair of short timeFourier transformed audio signals according to:X ₁₂[k,i]=α_(X) X ₁[k,i]X* ₂[k,i]+(1−α_(X))X ₁₂[k−1,i],X ₁₃[k,i]=α_(X) X ₁[k,i]X* ₃[k,i]+(1−α_(X))X ₁₃[k−1,i],X ₂₃[k,i]=α_(X) X ₂[k,i]X* ₃[k,i]+(1−α_(X))X ₂₃[k−1,i],determine an angle of the complex cross spectrum of each pair of shorttime Fourier transformed audio signals according to:

${{{\overset{\sim}{\psi}}_{12}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{12}\lbrack {k,i} \rbrack}{X_{12}^{*}\lbrack {k,i} \rbrack}}{{X_{12}\lbrack {k,i} \rbrack} + {X_{12}^{*}\lbrack {k,i} \rbrack}}}},{{{\overset{\sim}{\psi}}_{13}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{13}\lbrack {k,i} \rbrack}{X_{13}^{*}\lbrack {k,i} \rbrack}}{{X_{13}\lbrack {k,i} \rbrack} + {X_{13}^{*}\lbrack {k,i} \rbrack}}}},{{{\overset{\sim}{\psi}}_{23}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{23}\lbrack {k,i} \rbrack}{X_{23}^{*}\lbrack {k,i} \rbrack}}{{X_{23}\lbrack {k,i} \rbrack} + {X_{23}^{*}\lbrack {k,i} \rbrack}}}},$perform a phase unwrapping to {tilde over (ψ)}₁₂, {tilde over (ψ)}₁₃,{tilde over (ψ)}₂₃, resulting in Ψ₁₂, Ψ₁₃, Ψ₂₃ estimate the delay innumber of samples according to:δ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₂[k,i],δ₁₃[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₃[k,i],δ₂₃[k,i]=(N _(STFT)/2+1)/(iπ)ψ₂₃[k,i], if i≤i _(alias)orδ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₂[k,i],δ₁₃[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₃[k,i],δ₂₃[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₂₃[k,i], if i>i _(alias)estimate the delay in seconds according to:

${\tau_{12}\lbrack {k,i} \rbrack} = \frac{\delta_{12}\lbrack {k,i} \rbrack}{f_{s}}$${\tau_{13}\lbrack {k,i} \rbrack} = \frac{\delta_{13}\lbrack {k,i} \rbrack}{f_{s}}$${\tau_{23}\lbrack {k,i} \rbrack} = \frac{\delta_{23}\lbrack {k,i} \rbrack}{f_{s}}$estimate the angles of incidence according to:

${{\theta_{12}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{12}\lbrack {k,i} \rbrack}}{d_{mic}} )}},{{\theta_{13}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{13}\lbrack {k,i} \rbrack}}{d_{mic}} )}},{{\theta_{23}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{23}\lbrack {k,i} \rbrack}}{d_{mic}} )}},$whereinx₁ is a first audio signal of the N audio signals,x₂ is a second audio signal of the N audio signals,x₃ is a third audio signal of the N audio signals,X₁ is a first short time Fourier transformed audio signal,X₂ is a second short time Fourier transformed audio signal,X₃ is a third short time Fourier transformed audio signal,k is a frame of the short time Fourier transformed audio signal, andi is a frequency bin of the short time Fourier transformed audio signal,X₁₂ is a cross spectrum of a pair of X₁ and X₂,X₁₃ is a cross spectrum of a pair of X₁ and X₃,X₂₃ is a cross spectrum of a pair of X₂ and X₃,α_(x) is a forgetting factor,X* is the conjugate complex of X,j is the imaginary unit,{tilde over (ψ)}₁₂ is an angle of the complex cross spectrum of X₁₂,{tilde over (ψ)}₁₃ is an angle of the complex cross spectrum of X₁₃,{tilde over (ψ)}₂₃ is an angle of the complex cross spectrum of X₂₃,i_(alias) is a frequency bin corresponding to an aliasing frequency,f_(s) is a sampling frequency,d_(mic) is a distance of the microphones, andc is the speed of sound. This allows for a simple and efficientdetermining of the delays.

According to a further implementation form of the first aspect, the beamderiver is configured to determine cardioid directional responsesaccording to:

${{D_{12}\lbrack {k,i} \rbrack} = {\frac{1}{2}( {1 + {\cos( {{\theta_{12}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} )}},{{D_{13}\lbrack {k,i} \rbrack} = {\frac{1}{2}( {1 + {\cos( {{\theta_{13}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} )}},{{D_{23}\lbrack {k,i} \rbrack} = {\frac{1}{2}( {1 + {\cos( {{\theta_{23}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} )}},$and derive the A-format direct sound signals according to:A ₁₂[k,i]=D ₁₂[k,i]X ₁[k,i],A ₁₃[k,i]=D ₁₃[k,i]X ₁[k,i],A ₂₃[k,i]=D ₂₃[k,i]X ₁[k,i],whereinD is a cardioid directional response, andA is an A-format direct sound signal. This allows for a simple andefficient determining of the beam signals.

According to a further implementation form of the first aspect, theencoder is configured to encode the A-format direct sound signals to thefirst-order ambisonic B-format direct sound signals according to:

${\begin{bmatrix}R_{W} \\R_{X} \\R_{Y}\end{bmatrix} = {\Gamma^{- 1}\begin{bmatrix}A_{12} \\A_{13} \\A_{23}\end{bmatrix}}},$whereinR_(W) is a first, zero-order ambisonic B-format direct sound signal,R_(x) is a first, first-order ambisonic B-format direct sound signal,R_(y) is a second, first-order ambisonic B-format direct sound signal,andΓ⁻¹ is the transformation matrix. This allows for a simple and efficientdetermining of the beam signals.

According to a further implementation form of the first aspect, thedevice comprises a direction of arrival estimator, configured toestimate a direction of arrival from the first-order ambisonic B-formatdirect sound signals, and a higher order ambisonic encoder, configuredto encode higher order ambisonic B-format direct sound signals, usingthe first-order ambisonic B-format direct sound signals and theestimated direction of arrival, wherein higher order ambisonic B-formatdirect sound signals have an order higher than one. Thereby, anefficient encoding of the ambisonic B-format direct sound signal isachieved.

According to a further implementation form of the first aspect, thedirection of arrival estimator is configured to estimate the directionof arrival according to:

${{\theta_{XY}\lbrack {k,i} \rbrack} = {\arctan\frac{R_{Y}\lbrack {k,i} \rbrack}{R_{X}\lbrack {k,i} \rbrack}}},$whereinθ_(XY) [k,i] is a direction of arrival of a direct sound of frame k andfrequency bin i. This allows for a simple and efficient determining ofthe directions of arrival.

According to a further implementation form of the first aspect, thehigher order ambisonic B-format direct sound signals comprise secondorder ambisonic B-format direct sound signals limited to two dimensions,wherein the higher order ambisonic encoder is configured to encode thesecond order ambisonic B-format direct sound signals according to:

${R_{R}\overset{\Delta}{=}{{( {{3\sin^{2}\phi} - 1} )\text{/}2} = {{- 1}\text{/}2}}},{R_{S}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\mspace{14mu}\cos\mspace{14mu}\theta\mspace{14mu}\sin\mspace{14mu} 2\phi} = 0}},{R_{T}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\mspace{14mu}\sin\mspace{14mu}\theta\mspace{14mu}\sin\mspace{14mu} 2\phi} = 0}},{R_{U}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\mspace{14mu}\cos\mspace{14mu} 2\theta\mspace{14mu}\cos^{2}\mspace{14mu}\phi} = {\sqrt{3}\text{/}2\mspace{14mu}\cos\mspace{14mu} 2\theta_{XY}}}},{R_{V}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\mspace{14mu}\sin\mspace{14mu} 2\theta\mspace{14mu}\cos^{2}\mspace{14mu}\phi} = {\sqrt{3}\text{/}2\mspace{14mu}\sin\mspace{14mu} 2\theta_{XY}}}},$whereinR_(R) is a first, second-order ambisonic B-format direct sound signal,R_(S) is a second, second-order ambisonic B-format direct sound signal,R_(T) is a third, second-order ambisonic B-format direct sound signal,R_(U) is a fourth, second-order ambisonic B-format direct sound signal,R_(V) is a fifth, second-order ambisonic B-format direct sound signal,Δ denotes “defined as”,ϕ is an elevation angle, andθ is an azimuth angle. This allows for an efficient encoding of thehigher order ambisonic B-format signals.

According to a further implementation form of the first aspect, theaudio encoding device comprises a microphone matcher, configured toperform a matching of the N frequency domain audio signals, resulting inN matched frequency domain audio signals. This allows for furtherquality increase of the output signals.

According to a further implementation form of the first aspect, theaudio encoding device comprises a diffuse sound estimator, configured toestimate a diffuse sound power, and a de-correlation filter bank,configured to perform a de-correlation of the diffuse sound power bygenerating three orthogonal diffuse sound components from the diffusesound estimate power. This allows for implementing diffuse sound intothe output signals.

According to a further implementation form of the first aspect, thediffuse sound estimator is configured to estimate the diffuse soundpower according to:

${A = {1 - \Phi_{diff}^{2}}},{V = {{2\Phi_{diff}E\{ {X_{1}X_{2}^{*}} \}} - {E\{ {X_{1}X_{1}^{*}} \}} - {E\{ {X_{2}X_{2}^{*}} \}}}},{C = {{E\{ {X_{1}X_{1}^{*}} \} E\{ {X_{2}X_{2}^{*}} \}} - {E\{ {X_{1}X_{2}^{*}} \}^{2}}}},{{P_{diff}\lbrack {k,i} \rbrack} = \frac{{- B} - \sqrt{B^{2} - {4{AC}}}}{2A}},$whereinP_(diff) is the diffuse sound power,E{ } is an expectation value,Φ_(diff) ² is a normalized cross-correlation coefficient between N₁ andN₂,N₁ is diffuse sound in a first channel, andN₂ is diffuse sound in a second channel. This allows for an especiallyefficient estimation of the diffuse sound power.

According to a further implementation form of the first aspect, thede-correlation filter bank is configured to perform the de-correlationof the diffuse sound power by generating three orthogonal diffuse soundcomponents from the diffuse sound estimate power:{tilde over (D)} _(W)[k,i]=DFR_(W) w _(u) U ₁ P _(2D-diff)[k,i],{tilde over (D)} _(X)[k,i]=DFR_(X) w _(u) U ₂ P _(2D-diff)[k,i],{tilde over (D)} _(Y)[k,i]=DFR_(Y) w _(u) U ₃ P _(2D-diff)[k,i],wherein

${{DFR}_{a}\overset{\Delta}{=}{\frac{1}{4\pi}{\int\limits_{- \frac{\pi}{2}}^{\frac{\pi}{2}}{\int\limits_{- \pi}^{\pi}{{{R_{a}( {\theta,\phi} )}}^{2}\mspace{14mu}\cos\;\phi\mspace{14mu} d\;\theta\mspace{14mu} d\;\phi}}}}},{{R_{X}( {\theta,\phi} )} = {\cos\mspace{14mu}\phi\mspace{14mu}\cos\mspace{14mu}\theta}}$R_(Y)(θ, ϕ) = cos   ϕ  sin   θ R_(W)(θ, ϕ) = 1${w_{u}\lbrack n\rbrack} = {{{{\exp( {- \frac{0.5\mspace{14mu}\ln\mspace{14mu} 1e\; 6\mspace{14mu}{n}}{f_{s}{RT}_{60}}} )}\mspace{14mu}{with}}\mspace{14mu} - l_{u}} < n < l_{u}}$wherein {tilde over (D)}_(W)[k,i] is a first channel diffuse soundcomponent,wherein {tilde over (D)}_(X)[k,i] is second channel diffuse soundcomponent,wherein {tilde over (D)}_(Y)[k,i] is third channel diffuse soundcomponent,DFR_(W) is a diffuse-field response of the first channel,DFR_(X) is a diffuse-field response of the second channel,DFR_(Y) is a diffuse-field response of the third channel,w_(u) is an exponential window,RT₆₀ is a reverberation time,U₁,U₂,U₃ is the de-correlation filter bank,u is Gaussian noise sequence,l_(u) is a given length of the Gaussian noise sequence, andP_(2D-diff) is the diffuse noise power. Thereby, an efficientde-correlation of the diffuse sound power is calculated.

According to a further implementation form of the first aspect, theaudio encoding device comprises an adder, configured to addchannel-wise, the first-order ambisonic B-format direct sound signalsand the higher order ambisonic B-format direct sound signals, and/or thediffuse sound signals, resulting in complete ambisonic B-format signals.Thereby, in a simple manner, a finished output signal is generated.

According to a second aspect of the present disclosure, an audiorecording device comprising N microphones configured to record the Naudio signals and an audio encoding device according to the first aspector any of the implementation forms of the first aspect is provided. Thisallows for an audio recording and encoding in a single device.

According to a third aspect of the present disclosure, a method forencoding N audio signals, from N microphones, where N≥3 is provided. Themethod comprises estimating angles of incidence of direct sound byestimating for each pair of the N audio signals an angle of incidence ofdirect sound, and deriving A-format direct sound signals from theestimated angles of incidence by deriving from each estimated angle ofincidence an A-format direct sound signal, each A-format direct soundsignal being a first-order virtual microphone signal. This allows fordetermining the A-format direct sound signals with a low hardwareeffort.

According to an implementation form of the third aspect, the methodadditionally comprises encoding the ambisonic A-format direct soundsignals in first-order ambisonic B-format direct sound signals byapplying at least one transformation matrix to the A-format direct soundsignals. This allows for a simple and efficient determining of theambisonic B-format direct sound signals.

The method may further comprise extracting higher order ambisonicB-format direct sound signals by extracting direction of arrival fromfirst order ambisonic B-format direct sound signals.

According to a fourth aspect of the present disclosure, a computerprogram with a program code for performing the method according to thethird aspect is provided.

A method is provided for parametric encoding of multiple omnidirectionalmicrophone signals into any order Ambisonic B-format by means of:

-   -   robust estimation of the angle of incidence of sound, based on        microphone pair beam signals    -   and de-correlation of diffuse sound

The disclosed approach is based on at least three omnidirectionalmicrophones on a mobile device. Successively, it estimates the angles ofincidence of direct sound by means of delay estimation between thedifferent microphone pairs. Given the incidences of direct sound, itderives beam signals, called the direct sound A-format signals. Thedirect sound A-format signals are then encoded into first order B-formatusing relevant transformation matrix.

For optional higher order B-format, a direction of arrival estimate isderived from the X and Y first order B-format signals. The diffuse,non-directive sound is optionally rendered as multiple orthogonalcomponents, generated using de-correlation filters.

Generally, it has to be noted that all arrangements, devices, elements,units and means and so forth described in the present application couldbe implemented by software or hardware elements or any kind ofcombination thereof. Furthermore, the devices may be processors or maycomprise processors, wherein the functions of the elements, units andmeans described in the present applications may be implemented in one ormore processors. All steps which are performed by the various entitiesdescribed in the present application as well as the functionalitydescribed to be performed by the various entities are intended to meanthat the respective entity is adapted to or configured to perform therespective steps and functionalities. Even if in the followingdescription or exemplary embodiments, a specific functionality or stepto be performed by a general entity is not reflected in the descriptionof a specific detailed element of that entity which performs thatspecific step or functionality, it should be clear for a skilled personthat these methods and functionalities can be implemented in respect ofsoftware or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The present disclosure is in the following explained in detail inrelation to embodiments of the present disclosure in reference to theenclosed drawings, in which:

FIG. 1 shows a first embodiment of the audio encoding device accordingto the first aspect of the present disclosure and the audio recordingdevice according to the second aspect of the present disclosure;

FIG. 2 shows a second embodiment of the audio encoding device accordingto the first aspect of the present disclosure and the audio recordingdevice according to the second aspect of the present disclosure;

FIG. 3 shows a pair of microphones in a diagram depicting thedetermining of an angle of incidence of a sound event;

FIG. 4 shows a third embodiment of the audio recording device accordingto the second aspect of the present disclosure;

FIG. 5 shows A-format direct sound signals in a two-dimensional diagram;

FIG. 6 shows B-format direct sound signals in a two-dimensional diagram;

FIG. 7 shows diffuse sound received by two microphones;

FIG. 8 shows direct sound and diffuse sound in a two-dimensionaldiagram;

FIG. 9 shows an example of a de-correlation filter, as used by an audioencoding device according to a fourth embodiment of the first aspect;and

FIG. 10 shows an embodiment of the third aspect of the presentdisclosure in a flow diagram.

DETAILED DESCRIPTION

First, we demonstrate the construction and general function of anembodiment of the first aspect and second aspect of the presentdisclosure along FIG. 1 . With regard to FIG. 2 -FIG. 9 , furtherdetails of the construction and function of the first embodiment and thesecond embodiment are shown. With regard to FIG. 10 , finally thefunction of an embodiment of the third aspect of the present disclosureis described in detail.

In FIG. 1 , a first embodiment of the audio encoding device 3 is shown.Moreover, a first embodiment of the audio recording device 1 accordingto the second aspect of the present disclosure is shown.

The audio recording device 1 comprises a number of N≥3 microphones 2,which are connected to the audio encoding device 3. The audio encodingdevice 3 comprises a delay estimator 11, which is connected to themicrophones 2. The audio encoding device 3 moreover comprises a beamderiver 12, which is connected to the delay estimator. Furthermore, theaudio encoding device 3 comprises an encoder 13, which is connected tothe beam deriver 12. Note that the encoder 13 is an optional featurewith regard to the first aspect of the present disclosure.

In order to determine ambisonic B-format direct sound signals, themicrophones 2 record N≥3 audio signals. These audio signals arepreprocessed by components integrated into the microphones 2, in thisdiagram. For example, a transformation into the frequency domain isperformed. This will be shown in more detail along FIG. 2 . Thepreprocessed audio signals are handed to the delay estimator 11, whichestimates angles of incidence of direct sound by estimating for eachpair of the N audio signals and angle of incidence of direct sound.These angles of incidence of direct sound are handed to the beam deriver12, which derives A-format direct sound signals therefrom. Each A-formatdirect sound signal is a first-order virtual microphone signal,especially a cardioid signal. These signals are handed on to the encoder13, which encodes the A-format direct sound signals to first-orderambisonic B-format direct sound signals by applying a transformationmatrix to the A-format direct sound signals. The encoder outputs thefirst-order ambisonic B-format direct sound signals.

In FIG. 2 , a second embodiment of the audio encoding device 3 and theaudio recording device 1 are shown. Here, the individual microphones 2a, 2 b, 2 c, which correspond to the microphones 2 of FIG. 1 , areshown. Each of the microphones 2 a, 2 b, 2 c is connected to ashort-time Fourier transformer 10 a, 10 b, 10 c, which each performs ashort-time Fourier transformation of the N audio signals resulting in Nshort-time Fourier transformed audio signals. These are handed on to thedelay estimator 11, which performs the delay estimation and hands theangles of incidence to the beam deriver 12. The beam deriver 12determines the A-format direct sound signals and hands them to theencoder 13, which performs the encoding to B-format direct soundsignals. In FIG. 2 , further components of the audio encoding device 3are shown. Here, the audio encoding device 3 moreover comprises adirection-of-arrival estimator 20, which is connected to the encoder 13.Moreover, it comprises a higher order ambisonic encoder 21, which isconnected to the direction-of-arrival estimator 20.

The direction-of-arrival estimator 20 estimates a direction of arrivalfrom the first-order ambisonic B-format direct sound signals and handsit to the higher order ambisonic encoder 21. The higher order ambisonicencoder 21 encodes higher order ambisonic B-format direct sound signals,using the first-order ambisonic B-format direct sound signals and theestimated direction of arrival as an input. The higher order ambisonicB-format direct sound signals have a higher order than 1.

Moreover, the audio encoding device 3 comprises a microphone matcher 30,which performs a matching of the N frequency domain audio signals outputby the short-time Fourier transformers 10 a, 10 b, 10 c resulting in Nmatch frequency domain audio signals. Connected to the microphonematcher 30, the audio encoding device 3 moreover comprises a diffusesound estimator 31, which is configured to estimate a diffuse soundpower based upon the N match frequency domain audio signals.Furthermore, the audio encoding device 3 comprises a de-correlationfilter bank 32, which is connected to the diffuse sound estimator 31 andconfigured to perform a de-correlation of the diffuse sound power bygenerating three orthogonal diffuse sound components from the diffusesound estimate power.

Finally, the audio encoding device 3 comprises an adder 40, which addsthe first-order B-format direct sound signals provided by the encoder13, the higher order ambisonic B-format signals provided by the higherorder encoder 21 and the diffuse sound components provided by thede-correlation filter bank 32. The sum signal is handed to an inverseshort-time Fourier transformer 41, which performs an inverse short-timeFourier transformation to achieve the final ambisonic B-format signalsin the time domain.

In the following, along FIG. 3-9 , further details regarding thefunction of the individual components, shown in FIG. 2 are described.

In FIG. 3 , an angle of incidence, as it is determined by the delayestimator 11 is shown.

Especially, the propagation of direct sound following a ray from a soundsource to a pair of microphones in the free-field is considered in FIG.3 .

In FIG. 4 , an example of an audio recording device 1 is shown in atwo-dimensional diagram. The three microphones 2 a, 2 b, 2 c aredepicted in their actual physical location.

The following algorithm aims at estimating the angle of incidence ofdirect sound based on cross-correlation between both recorded microphonesignals x₁ and x₂, and derives parametrically gain filters to generatebeams focusing in specific directions.

A phase estimation, between both recording microphones, is carried outat each time-frequency tile. The microphone time-frequencyrepresentations, X₁ and X₂, of the microphone signals, are obtainedusing a N_(STFT) points short-time Fourier transform (STFT). The delayrelation between the two microphones can be derived from thecross-spectrum:X ₁₂[k,i]=α_(X) X ₁[k,i]X* ₂[k,i]+(1−α_(X))X ₁₂[k−1,i],  (2)where * denotes the complex conjugate operator. And a_(x) is determinedby:

$\begin{matrix}{{\alpha_{X} = \frac{N_{STFT}}{T_{X}f_{s}}},} & (3)\end{matrix}$where T_(X) is an

time-constant in seconds and f_(s) is the sampling frequency. The phaseresponse is defined as the angle of the complex cross-spectrum X₁₂,derived as the ratio between the imaginary and the real part of it:

$\begin{matrix}{{{{\overset{\sim}{\psi}}_{12}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{12}\lbrack {k,i} \rbrack}{X_{12}^{*}\lbrack {k,i} \rbrack}}{{X_{12}\lbrack {k,i} \rbrack} + {X_{12}^{*}\lbrack {k,i} \rbrack}}}},} & (4)\end{matrix}$where j is the imaginary unit, that satisfies j²=−1.

Unfortunately, analogous to the Nyquist frequency in temporal sampling,a microphone array has a restriction on the minimum spatial samplingrate. Using two microphones, the smallest wavelength of interest isgiven by:λ_(alias)=2d _(mic)  (5)corresponding to a maximum frequency,

$\begin{matrix}{{f_{alias} = \frac{c}{\lambda_{alias}}},} & (6)\end{matrix}$up to which the phase estimation is unambiguous. Above this frequency,the measured phase is still obtained following (4) but with anuncertainty term related to an integer l modulo of 2π:{tilde over (ψ)}₁₂[k,i]=ψ₁₂[k,i]+2π·l[i].  (7)

Because the maximum travelling time between the two microphones of thearray is given by d_(mic)/c, the bounds of integer l is defined by:

$\begin{matrix}{{{{l\lbrack i\rbrack} \leq {L\lbrack i\rbrack}} = \frac{{id}_{mic}f_{s}}{c( {\frac{N_{STFT}}{2} + 1} )}},} & (8)\end{matrix}$

A high frequency extension is provided based in equation (8) toconstrain an unwrapping algorithm. The unwrapping aims at correcting thephase angle {tilde over (ψ)}₁₂[k,i] by adding a multiple l[k,i] of 2πwhen absolute jump between the two consecutive elements, |{tilde over(ψ)}₁₂[k,i]−{tilde over (ψ)}₁₂[k,i−1]|, are greater than or equal to thejump tolerance of π. The estimated unwrapped phase ψ₁₂ is obtained bylimiting the multiples l to their physical possible values. Eventually,even if the phase is aliased at high-frequency, its slope still followsthe same principles as the delay estimation at low frequency. For thepurpose of delay estimation, it is then sufficient to integrate theunwrapped phase ψ₁₂ over a number of frequency bins in order to deriveits slope for later delay

$\begin{matrix}{{{\Psi_{12}\lbrack {k,i} \rbrack} = {\frac{1}{2N_{hf}}{\sum\limits_{j = {- N_{hf}}}^{N_{hf}}\;{\psi_{12}\lbrack {k,{i + j}} \rbrack}}}},} & (9)\end{matrix}$where N_(hf) stands for the frequency bandwidth on which the phase isintegrated.

For each frequency bin i, dividing by the corresponding physicalfrequency, the delay δ₁₂[k,i], expressed in number of samples, isobtained from the previously derived phase:δ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₂[k,i] if i≤i _(alias)

otherwise:δ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₂[k,i],  (10)where i_(alias) is the frequency bin corresponding to the aliasingfrequency (1). The delay in second is:

$\begin{matrix}{{\tau_{12}\lbrack {k,i} \rbrack} = {\frac{\delta_{12}\lbrack {k,i} \rbrack}{f_{s}}.}} & (11)\end{matrix}$

The derived delay relates directly to the angle of incidence of soundemitted by a sound source, as illustrated in FIG. 2 . Given thetravelling time delay between both microphones, the resulting angle ofincidence θ₁₂[k,i] is:

$\begin{matrix}{{{\theta_{12}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{12}\lbrack {k,i} \rbrack}}{d_{mic}} )}},} & (12)\end{matrix}$with d_(mic) the distance between both microphones and c the celerity ofsound in the air.

In free-field, for direct sound, the directional response of a cardioidmicrophone pointing on the side of the array, is built as a function ofthe estimated angle of incidence:

$\begin{matrix}{{D\lbrack {k,i} \rbrack} = {\frac{1}{2}{( {1 + {\cos( {{\theta_{12}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} ).}}} & (13)\end{matrix}$

By applying the gain D to the input spectrum X₁, a virtual cardioidsignal can be retrieved from the direct sound of the input microphonesignals. This corresponds to the function of the beam estimator 12.

In FIG. 5 , three cardioid signals based upon three microphone pairs aredepicted in a two-dimensional diagram, showing the respective gains.

In FIG. 6 , the gains of B-format ambisonic direct sound signals areshown in a two-dimensional diagram.

In the following, the conversion from A-format direct sound signals toB-format direct sound signals is shown. This corresponds to the functionof the encoder 13.

In the following Table are listed the Ambisonic B-format channels andtheir spherical representation D(θ,ϕ) up to third-order, normalized withthe Schmidt semi-normalization (SN3D), where θ and ϕ are, respectively,the azimuth and elevation angles:

Order Channel SN3D Definition: D(θ, ϕ) = 0 W 1 1 X cos θcos ϕ Y sin θcosϕ Z sin ϕ 2 R (3sin² ϕ − 1)/2 S {square root over (3/2)} cosθsin2ϕ T{square root over (3/2)} sinθsin2ϕ U {square root over (3/2)} cos2θcos²ϕ V {square root over (3/2)} sin2θcos² ϕ 3 K sinϕ(5sin² ϕ − 3)/2 L{square root over (3/8)} cosθcosϕ(5sin² ϕ − 1) M {square root over(3/8)} sinθcosϕ(5sin² ϕ − 1) N {square root over (15/2)} cos2θsinϕcos² ϕO {square root over (15/2)} sin2θsinϕcos² ϕ P {square root over (5/8)}cos3θcos³ ϕ Q {square root over (5/8)} sin3θcos³ ϕ

These spherical harmonics form a set of orthogonal basis functions andcan be used to describe any function on the surface of a sphere.

Without loss of generality, three, the minimum number of, microphonesare considered and placed in the horizontal XY-plane, for instancedisposed at the edges of a mobile device as illustrated in FIG. 3 ,having the coordinates (x_(m) ₁ , y_(m) ₁ ), (x_(m) ₂ , y_(m) ₂ ), and(x_(m) ₃ , y_(m) ₃ ).

The three possible unordered microphone pairs are defined as:pair 1Δ=mic2→mic1pair 2Δ=mic3→mic2pair 3Δ=mic1→mic3

The look direction (Θ=0) being defined by the X-axis, their directionvectors are:

$\begin{matrix}{{{v_{p_{1}} = {\begin{pmatrix}x_{m_{1}} \\y_{m_{1}}\end{pmatrix} - \begin{pmatrix}x_{m_{2}} \\y_{m_{2}}\end{pmatrix}}},{v_{p_{2}} = {( {x_{m_{2}},y_{m_{2}}} ) - \begin{pmatrix}x_{m_{3}} \\y_{m_{3}}\end{pmatrix}}},{and}}{v_{p_{3}} = {\begin{pmatrix}x_{m_{3}} \\y_{m_{3}}\end{pmatrix} - {( {x_{m_{1}},y_{m_{1}}} ).}}}} & (14)\end{matrix}$

The direction for each of the pair in the horizontal plane are:

$\begin{matrix}{{\forall{n \in \lbrack {1.{.3}} \rbrack}},{\theta_{p_{n}} = {{\arctan( \frac{y_{v_{p_{n}}}}{x_{v_{p_{n}}}} )}.}}} & (15)\end{matrix}$

And the microphone spacing:

$\begin{matrix}{{\forall{n \in \lbrack {1.{.3}} \rbrack}},{\partial_{p_{n}}{= {\sqrt{x_{v_{p_{n}}}^{2} + y_{v_{p_{n}}}^{2}}.}}}} & (16)\end{matrix}$

The gain (13) resulting from the angle of incidence estimation isapplied to each pair leading to cardioid directional responses:∀n∈[1 . . . 3],A _(p) _(n) [k,i]=D _(p) _(n) [k,i]X ₁[k,i].  (17)

The three resulting cardioids are pointing in the three directions θ_(p)₁ , θ_(p) ₂ , and θ_(p) ₃ , defining the corresponding A-formatrepresentation, as illustrated in FIG. 4 .

Assuming that the obtained cardioids are coincident, the correspondingfirst order Ambisonic B-format signals can be computed by means oflinear combination of the spectra A_(p) _(n) , The conversion fromAmbisonic B-format to A-format is implemented as:

$\begin{matrix}{\begin{bmatrix}A_{p_{1}} \\A_{p_{2}} \\A_{p_{3}}\end{bmatrix} = {{\Gamma\begin{bmatrix}R_{W} \\R_{X} \\R_{Y}\end{bmatrix}} = {{\frac{1}{2}\begin{bmatrix}1 & {\cos\mspace{14mu}\theta_{p_{1}}} & {\sin\mspace{14mu}\theta_{p_{1}}} \\1 & {\cos\mspace{14mu}\theta_{p_{2}}} & {\sin\mspace{14mu}\theta_{p_{2}}} \\1 & {\cos\mspace{14mu}\theta_{p_{3}}} & {\sin\mspace{14mu}\theta_{p_{3}}}\end{bmatrix}}\begin{bmatrix}R_{W} \\R_{X} \\R_{Y}\end{bmatrix}}}} & (18)\end{matrix}$

The inverse matrix of (18) enables to convert the cardioids to AmbisonicB-format,

$\begin{matrix}{\begin{bmatrix}R_{W} \\R_{X} \\R_{Y}\end{bmatrix} = {\Gamma^{- 1}\begin{bmatrix}A_{p_{1}} \\A_{p_{2}} \\A_{p_{3}}\end{bmatrix}}} & (19)\end{matrix}$

The first order Ambisonic B-format normalized directional responsesR_(W), R_(X), and R_(Y), are shown in FIG. 5 , where R_(W) correspondsto a monopole. while the signals R_(X) and R_(Y) correspond to twoorthogonal dipoles.

In the following, the determining of higher order ambisonic B-formatsignals is shown. This corresponds to the function of thedirection-of-arrival estimator 20 and the higher order ambisonic encoder21.

Deriving previously, the first order ambisonic B-format signals R_(W),R_(X), and R_(Y) for the direct sound, no explicit direction of arrival(DOA) of sound was computed. Instead the directional responses of thethree signals R_(W), R_(X), and R_(Y) have been obtained from theA-format cardioid signals A_(p) _(n) in (17).

In order to obtain the higher order (e.g. second and third) ambisonicB-format signals, an explicit DOA is derived based on the two firstorder ambisonic B-format signals R_(X) and R_(Y) as:

$\begin{matrix}{{\theta_{XY}\lbrack {k,i} \rbrack} = {\arctan{\frac{R_{Y}\lbrack {k,i} \rbrack}{R_{X}\lbrack {k,i} \rbrack}.}}} & (20)\end{matrix}$

Again, assuming three omnidirectional microphones in the horizontalplane (φ=0), the channels of interest as defined in the ambisonicdefinition in the Table are limited to:

-   -   order 0: W    -   order 1: X, Y    -   order 2: R, U, V    -   order 3: L, M, P, Q

The other channels are null since they are modulated by sinφ, with φ=0.For each of the above listed channels the directional responses are thusderived by substituting the azimuth angle Θ by the estimated DOA Θ_(XY).For instance, considering second order (assuming no elevation, i.e.φ=0):

$\begin{matrix}{{R_{R}\overset{\Delta}{=}{{( {{3\sin^{2}\phi} - 1} )\text{/}2} = {{- 1}\text{/}2}}}{R_{S}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\cos\;\theta\;\sin\; 2\phi} = 0}}{R_{T}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\sin\;\theta\;\sin\; 2\phi} = 0}}{R_{U}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\cos\; 2{\theta cos}^{2}\phi} = {\sqrt{3}\text{/}2\mspace{14mu}\cos\mspace{14mu} 2\theta_{XY}}}}{R_{V}\overset{\Delta}{=}{{\sqrt{3}\text{/}2\sin\; 2{\theta cos}^{2}\phi} = {\sqrt{3}\text{/}2\mspace{14mu}\sin\mspace{14mu} 2\theta_{XY}}}}} & (21)\end{matrix}$

The resulting ambisonic channels, R_(R), R_(U), R_(V), R_(L), R_(M),R_(P), and R_(Q), contain only the direct sound components of the soundfield.

Now, the handling of diffuse sound is shown. This corresponds to thediffuse sound estimator 31 and the de-correlation filter bank 32 of FIG.2 .

In FIG. 7 , the occurrence of direct sound from a sound source andomnidirectional diffuse sound is shown in a diagram depicting thelocations of two microphones.

In FIG. 8 , the directional responses to a sound source of direct soundis shown. Additionally, omnidirectional diffuse sound is depicted.

The previous derivation of the ambisonic B-format signals is only validunder the assumption of direct sound. It does not hold for diffusesound. In the following a method for obtaining an equivalent diffusesound for Ambisonic B-format signals is given. Considering enough timeafter the direct sound and a number of early reflections, numerousreflections are themselves reflected in the space creating a diffusesound field. By diffuse sound field is mathematically understood asindependent sounds having the same energy and coming from alldirections, as illustrated in FIG. 7 .

It is assumed that X₁ and X₂ can be modelled as:X ₁[k,i]=S[k,i]+N ₁[k,i],X ₂[k,i]=a[k,i]S[k,i]+N ₂[k,i],  (22)where a[k,i] is a gain factor, S[k,i] is the direct sound in the leftchannel, and N₁[k,i] and N₂[k,i] represent diffuse sound. From (22) itfollows that:E{X ₁ X* ₁ }=E{SS*}+E{N ₁ N* ₁}E{X ₂ X* ₂ }=a ² E{SS*}+E{N ₂ N* ₂}E{X ₁ X* ₂ }=aE{SS*}+E{N ₁ N* ₂}.  (23)

It is reasonable to assume that the amount of diffuse sound in bothmicrophone signals is the same, i.e. E{N₁N*₁}=E{N₂N*₂}=E{NN*}.Furthermore, the normalized cross-correlation coefficient between N₁ andN₂ is denoted Φ_(diff) and can be obtained from the Cook's,

$\begin{matrix}{{\Phi_{diff}\lbrack i\rbrack} = {{\frac{\sin\mspace{14mu} D}{D}\mspace{14mu}{with}\mspace{14mu} D} = {\frac{2\pi\;{if}_{s}d_{mic}}{{cN}_{STFT}}.}}} & (24)\end{matrix}$Eventually (23) can be re-written asE{X ₁ X* ₁ }=E{SS*}+E{NN*}E{X ₂ X* ₂ }=a ² E{SS*}+E{NN*}E{X ₁ X* ₂ }=aE{SS*}+Φ _(diff) E{NN*}.  (25)

Elimination of E{SS*} and a in (25) yields the quadratic equation:AE{NN*} ² +BE{NN*}+C=0  (26)withA=1−Φ_(diff) ²,B=2Φ_(diff) E{X ₁ X* ₂ }−E{X ₁ X* ₁ }−E{X ₂ X* ₂},C=E{X ₁ X* ₁ }E{X ₂ X* ₂ }−E{X ₁ X* ₂}².  (27)

The power estimate of diffuse sound, denoted P_(diff), is then one ofthe two solutions of (26), the physically possible one (the othersolution of (26), yielding a diffuse sound power larger than themicrophone signal power, is discarded, as it is physically impossible),i.e.:

$\begin{matrix}{{P_{diff}\lbrack {k,i} \rbrack} = {{E\{ {NN}^{*} \}} = {\frac{{- B} - \sqrt{B^{2} - {4{AC}}}}{2A}.}}} & (28)\end{matrix}$

Note that straightforwardly the contribution of the direct sound can becomputed as:P _(dir)[k,i]=P _(X) ₁ [k,i]−P _(diff)[k,i].  (29)

This corresponds to the function of the diffuse sound estimator 31.

By definition the Ambisonic B-format signals are obtained by projectingthe sound field unto the spherical harmonics basis defined in theprevious table. Mathematically, the projection corresponds to theintegration of the sound field signal over the spherical harmonics.

As illustrated in FIG. 7 , due to the orthogonality property of thespherical harmonics basis: projecting mathematically independent soundsfrom all directions unto this basis will result in three orthogonalcomponents:D _(W) ⊥D _(X) ⊥D _(Y).  (30)

Note that this property does not hold anymore for direct sound, since asound source emitting from only ne direction projected unto the samebasis will result in a single gain equal to the directional responses atthe incidence angle of the sound source, leading to non-orthogonal, orin other terms, correlated components R_(W), R_(X), and R_(Y).

However, here, considering a distribution of three omnidirectionalmicrophones, the single diffuse sound estimate (28) is equivalent forall three microphones (or all three microphone pairs). Therefore thereis no possibility to retrieve the native diffuse sound components of theAmbisonic B-format signals, i.e. D_(W), D_(X), and D_(Y) as they wouldbe obtained separately by projection of the diffuse sound field unto thespherical harmonics basis.

Instead of getting the exact diffuse sound Ambisonic B-format signals,an alternative is to generate three orthogonal diffuse sound componentsfrom the single known diffuse sound estimate P_(diff). This way, even ifthe diffuse sound components do not correspond to the native AmbisonicB-format obtained by projection, the most perceptually importantproperty of orthogonality (enabling localization and spatialization) ispreserved. This can be achieved by using de-correlation filters.

The de-correlation filters are derived from a Gaussian noise sequence uof given length l_(u). A Gram-Schmidt process applied to this sequenceleads to N_(u) orthogonal sequences U₁, U₂, Λ, U_(N) _(u) which serve asfilters to generate N_(u) orthogonal diffuse sounds. In the threemicrophones case described previously (N_(u)=3):

Given the length l_(u) of the noise Gaussian noise sequence u, thede-correlation filters are shaped such that they have an exponentialdecay over time, similarly as reverberation is a room. To do so, thesequences U₁, U₂, Λ, U_(N) are multiplied with an exponential windoww_(u) with a time constant corresponding to the reverberation time RT₆₀:

$\begin{matrix}{{w_{u}\lbrack n\rbrack} = {{{{\exp( {- \frac{0.5\mspace{14mu}\ln\mspace{14mu} 1e\; 6\mspace{14mu}{n}}{f_{s}{RT}_{60}}} )}\mspace{14mu}{with}}\mspace{14mu} - l_{u}} < n < {l_{u}.}}} & (31)\end{matrix}$

In FIG. 9 , the filter response of a filter of the de-correlation filterbank 32 of FIG. 2 is shown. Especially the time constant of such afilter is depicted.

The exponential decay of the de-correlation filters, illustrated in FIG.9 , will directly have an influence on the diffuse sound components inthe B-format signals. A long decay will over emphasize the diffuse soundcontribution in the final B-format but will ensure better separationbetween the three diffuse sound components.

Eventually, the resulting de-correlation filters are modulated by thediffuse-field responses of the ambisonic B-format channels theycorrespond to. This way the amount of diffuse sound in each ambisonicB-format channel matches the amount of diffuse sound of a naturalB-format recording. The diffuse-field response DFR is the average of thecorresponding spherical harmonic directional-response-squaredcontributions considering all directions, i.e.:

$\begin{matrix}{{DFR} = {\frac{1}{4\pi}{\int\limits_{- \frac{\pi}{2}}^{\frac{\pi}{2}}{\int\limits_{- \pi}^{\pi}{{{D( {\theta,\phi} )}}^{2}\mspace{14mu}\cos\;\phi\mspace{14mu} d\;\theta\mspace{14mu} d\;{\phi.}}}}}} & (32)\end{matrix}$

In the three microphones case (N_(u)=3), the resulting de-correlationsfilters are:{tilde over (D)} _(W)[k,i]=DFR_(W) w _(u) U ₁ P _(2D-diff)[k,i],{tilde over (D)} _(X)[k,i]=DFR_(X) w _(u) U ₂ P _(2D-diff)[k,i],{tilde over (D)} _(Y)[k,i]=DFR_(Y) w _(u) U ₃ P _(2D-diff)[k,i].  (33)

This way, the orthogonality property between all three diffuse soundsbeing preserved any further processing using the generated B-format willwork on diffuse sound too, i.e., using conventional ambisonic decoding.

Eventually both direct and diffuse sound contributions have to be mixedtogether in order to generate the full Ambisonic B-format. Given theassumed signal model, the direct and diffuse sounds are, by definition,orthogonal, too. Thus the complete Ambisonic B-format signal areobtained using a straightforward addition:B _(W)[k,i]=R _(W)[k,i]+{tilde over (D)} _(W)[k,i],B _(X)[k,i]=R _(X)[k,i]+{tilde over (D)} _(X)[k,i],B _(Y)[k,i]=R _(Y)[k,i]+{tilde over (D)} _(Y)[k,i].  (34)This addition is performed by the adder 40 of FIG. 2 .

After this addition, only the inverse short-time Fourier transformationby the inverse short-time Fourier transformer 41 is performed in orderto achieve the output B-format ambisonic signals.

Finally, in FIG. 10 , an embodiment of the audio encoding methodaccording to the third aspect of the present disclosure is shown. In afirst optional step 100 at least 3 audio signals are recorded. In asecond step 101, angles of incidence of direct sound are estimated, byestimating for each pair of the N audio signals an angle of incidence ofdirect sound. In a third step 102, A-format direct sound signals arederived from the estimated angles of incidence, by deriving from eachestimated angle of incidence an A-format direct sound signal, eachA-format direct sound signal being a first-order virtual microphonesignal. In a fourth step 103, the ambisonic A-format direct soundsignals are encoded to first-order ambisonic B-format direct soundsignals by applying at least one transformation matrix to the A-formatdirect sound signals. Note that the fourth step of performing theencoding is an optional step with regard to the third aspect of thepresent disclosure. In a further optional fifth step 104, a higher orderambisonic B-Format signal is generated based on direction of arrivalderived from first order B-Format.

Note that the audio encoding device according to the first aspect of thepresent disclosure as well as the audio recording device according tothe second aspect of the present disclosure relate very closely to theaudio encoding method according to the third aspect of the presentdisclosure. Therefore, the elaborations along FIG. 1-9 are also validwith regard to the audio encoding method shown in FIG. 10 .

These encoded signals are fully compatible with conventional AmbisonicB-format signals, and thus, can be used as input for Ambisonic B-formatdecoding or any other processing. The same principle can be applied toretrieve full higher order Ambisonic B-format signals with both directand diffuse sounds contributions.

Abbreviations and Notations

Abbreviation Definition VR Virtual Reality DirAc Directional AudioCoding DOA Direction Of Arrival STFT short-Time Fourier Transform SN3DSchmidt semi-Normalization 3D DFR Diffuse-Field Response SNR Signal toNoise Ratio HOA High Order Ambisonic

Notation Definition x_(1, x2) Both recorded microphone signals X₁[k, i]STFT of x₁ in frame k and frequency bin i S[k, i] STFT of source signalN₁[k, i] Diffuse noise in microphone 1 α_(X) Forgeting factor T_(X)averaging time-constant X₁₂ [k, i] cross-spectrum two microphone signal1 and 2 f_(s) sampling frequency f_(alias) Frequency aliasing d_(mic)Distance between both microphones E { } Expectation oparator θ and ϕazimuth and elevation angles P_(diff) power estimate of diffuse noiseR_(W), R_(X), R_(Y) First order Ambisonic components R_(R), R_(U),R_(V), R_(L), R_(M), Higher order Ambisonic components R_(P), and R_(Q)P_(2D-diff) power estimate of diffuse noise in 2D U₁, U₂, Λ, U_(N) _(u)Orthogonal sequences {tilde over (ψ)}₁₂ Angle of the complexcross-spectrum X₁₂ Ψ₁₂ The mean of unwrapped phase ψ₁₂ over frequencyaliasing l[i] An uncertainty integer which depends on frequency i L[i]Upper bound function for l[i] which depends on frequency i D(θ, ϕ)Spherical representation of the Ambisonic channels A_(p) ₁ , A_(p) ₂ ,A_(p) ₃ , . . . , A_(p) _(n) The cardioids that each of them generatedwith pair of microphones RT₆₀ Reverberation time l_(u) Length ofGaussian noise sequence u w_(u) Exponential window DFR_(W), DFR_(X),DFR_(Y) Diffuse-Field Responses for W, X, Y components

The present disclosure is not limited to the examples and especially notto a specific number of microphones. The characteristics of theexemplary embodiments can be used in any advantageous combination.

The present disclosure has been described in conjunction with variousembodiments herein. However, other variations to the disclosedembodiments can be understood and effected by those skilled in the artin practicing the claimed invention, from a study of the drawings, thedisclosure and the appended claims. In the claims, the word “comprising”does not exclude other elements or steps and the indefinite article “a”or “an” does not exclude a plurality. A single processor or other unitmay fulfill the functions of several items recited in the claims. Themere fact that certain measures are recited in usually differentdependent claims does not indicate that a combination of these measurescannot be used to advantage. A computer program may bestored/distributed on a suitable medium, such as an optical storagemedium or a solid-state medium supplied together with or as part ofother hardware, but may also be distributed in other forms, such as viathe internet or other wired or wireless communication systems.

What is claimed is:
 1. An audio encoding device, for encoding N audiosignals, from N microphones where N≥3, the audio encoding devicecomprising: a delay estimator configured to estimate angles of incidenceof direct sound by estimating, for each pair of the N audio signals, anangle of incidence of the direct sound, and a beam deriver configured toderive A-format direct sound signals from the estimated angles ofincidence by deriving, from each of the estimated angles of incidence, arespective one of the A-format direct sound signals, each of theA-format direct sound signals being a first-order virtual microphonesignal; and an encoder configured to encode the A-format direct soundsignals in first-order ambisonic B-format direct sound signals byapplying a transformation matrix to the A-format direct sound signals,wherein N=3, wherein the audio encoding device comprises a short timeFourier transformer configured to perform a short time Fouriertransformation on each of the N audio signals x₁, x₂, x₃, resulting in Nshort time Fourier transformed audio signals X₁[k,i], X₂[k,i], X₃[k,i],wherein the delay estimator is configured to: determine cross spectra ofeach pair of the short time Fourier transformed audio signals accordingto:X ₁₂[k,i]=α_(X) X ₁[k,i]X* ₂[k,i]+(1−α_(X))X ₁₂[k−1,i],X ₁₃[k,i]=α_(X) X ₁[k,i]X* ₃[k,i]+(1−α_(X))X ₁₃[k−1,i], andX ₂₃[k,i]=α_(X) X ₂[k,i]X* ₃[k,i]+(1−α_(X))X ₂₃[k−1,i], determine anangle of the complex cross spectrum of each pair of the short timeFourier transformed audio signals according to:${{{\overset{\sim}{\psi}}_{12}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{12}\lbrack {k,i} \rbrack}{X_{12}^{*}\lbrack {k,i} \rbrack}}{{X_{12}\lbrack {k,i} \rbrack} + {X_{12}^{*}\lbrack {k,i} \rbrack}}}},{{{\overset{\sim}{\psi}}_{13}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{13}\lbrack {k,i} \rbrack}{X_{13}^{*}\lbrack {k,i} \rbrack}}{{X_{13}\lbrack {k,i} \rbrack} + {X_{13}^{*}\lbrack {k,i} \rbrack}}}},{and}$${{{\overset{\sim}{\psi}}_{23}\lbrack {k,i} \rbrack} = {\arctan\mspace{14mu} j\frac{{X_{23}\lbrack {k,i} \rbrack}{X_{23}^{*}\lbrack {k,i} \rbrack}}{{X_{23}\lbrack {k,i} \rbrack} + {X_{23}^{*}\lbrack {k,i} \rbrack}}}},$perform a phase unwrapping to {tilde over (ψ)} ₁₂ , {tilde over (ψ)} ₁₃, {tilde over (ψ)} ₂₃ , resulting in ψ₁₂ , ψ₁₃ , ψ₂₃ , estimate thedelay in number of samples according to:δ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₂[k,i],δ₁₃[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₃[k,i], andδ₂₃[k,i]=(N _(STFT)/2+1)/(iπ)ψ₂₃[k,i], if i≤i _(alias) orδ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₂[k,i],δ₁₃[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₃[k,i], andδ₂₃[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₂₃[k,i], if i>i _(alias) estimate thedelay in seconds according to:${{\tau_{12}\lbrack {k,i} \rbrack} = \frac{\delta_{12}\lbrack {k,i} \rbrack}{f_{s}}},{{\tau_{13}\lbrack {k,i} \rbrack} = \frac{\delta_{13}\lbrack {k,i} \rbrack}{f_{s}}},{and}$${{\tau_{23}\lbrack {k,i} \rbrack} = \frac{\delta_{23}\lbrack {k,i} \rbrack}{f_{s}}},$and estimate the angles of incidence according to:${{\theta_{12}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{12}\lbrack {k,i} \rbrack}}{d_{mic}} )}},{{\theta_{13}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{13}\lbrack {k,i} \rbrack}}{d_{mic}} )}},{and}$${{\theta_{23}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c\mspace{14mu}{\tau_{23}\lbrack {k,i} \rbrack}}{d_{mic}} )}},$and wherein: x₁ is a first audio signal of the N audio signals, x₂ is asecond audio signal of the N audio signals, x₃ is a third audio signalof the N audio signals, X₁ is a first short time Fourier transformedaudio signal of the short time Fourier transformed audio signals, X₂ isa second short time Fourier transformed audio signal of the short timeFourier transformed audio signals, X₃ is a third short time Fouriertransformed audio signal of the short time Fourer transformed audiosignals, k is a frame of the short time Fourier transformed audiosignals, and i is a frequency bin of the short time Fourier transformedaudio signals, X₁₂ is a cross spectrum of a pair of X₁ and X₂, X₁₃ is across spectrum of a pair of X₁ and X₃, X₂₃ is a cross spectrum of a pairof X₂ and X₃, α_(X) is a forgetting factor, X* is a conjugate complex ofX, j is the imaginary unit, {tilde over (ψ)} ₁₂ is an angle of thecomplex cross spectrum of X₁₂, {tilde over (ψ)} ₁₃ is an angle of thecomplex cross spectrum of X₁₃, {tilde over (ψ)} ₂₃ is an angle of thecomplex cross spectrum of X₂₃, i_(alias) is a frequency bincorresponding to an aliasing frequency, f_(s) is a sampling frequency,d_(mic) is a distance of the microphones, and c is the speed of sound.2. The audio encoding device according to claim 1, wherein the beamderiver is configured to: determine cardioid directional responsesaccording to:${{D_{12}\lbrack {k,i} \rbrack} = {\frac{1}{2}( {1 + {\cos( {{\theta_{12}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} )}},{{D_{13}\lbrack {k,i} \rbrack} = {\frac{1}{2}( {1 + {\cos( {{\theta_{13}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} )}},{and}$${{D_{13}\lbrack {k,i} \rbrack} = {\frac{1}{2}( {1 + {\cos( {{\theta_{23}\lbrack {k,i} \rbrack} - \frac{\pi}{2}} )}} )}},$and derive the A-format direct sound signals according to:A ₁₂[k,i]=D ₁₂[k,i]X ₁[k,i],A ₁₃[k,i]=D ₁₃[k,i]X ₁[k,i], andA ₂₃[k,i]=D ₂₃[k,i]X ₁[k,i], wherein: D is a cardioid directionalresponse, and A is an A-format direct sound signal of the A-formatdirect sound signals.
 3. The audio encoding device according to claim 2,wherein the encoder is configured to encode the A-format direct soundsignals to the first-order ambisonic B-format direct sound signalsaccording to: ${\begin{bmatrix}R_{W} \\R_{X} \\R_{Y}\end{bmatrix} = {\Gamma^{- 1}\begin{bmatrix}A_{12} \\A_{13} \\A_{23}\end{bmatrix}}},$ wherein: R_(w) is a first, zero-order ambisonicB-format direct sound signal, R_(x) is a first, first-order ambisonicB-format direct sound signal among the first-order ambisonic B-formatdirect sound signals, R_(y) is a second, first-order ambisonic B-formatdirect sound signal among the first-order ambisonic B-format directsound signals, and Γ⁻¹ is the transformation matrix.
 4. The audioencoding device according to claim 1, comprising a direction of arrivalestimator configured to estimate a direction of arrival from thefirst-order ambisonic B-format direct sound signals, and a higher orderambisonic encoder configured to encode higher order ambisonic B-formatdirect sound signals using the first-order ambisonic B-format directsound signals and the estimated direction of arrival, wherein higherorder ambisonic B-format direct sound signals have an order higher thanone.
 5. The audio encoding device according to claim 4, wherein thedirection of arrival estimator is configured to estimate the directionof arrival according to:${{\theta_{XY}\lbrack {k,i} \rbrack} = {\arctan\frac{R_{Y}\lbrack {k,i} \rbrack}{R_{X}\lbrack {k,i} \rbrack}}},$and wherein θ_(XY) [k,i] is the direction of arrival of the direct soundof frame k and frequency bin i.
 6. The audio encoding device accordingto claim 5, wherein the higher order ambisonic B-format direct soundsignals comprise second order ambisonic B-format direct sound signalslimited to two dimensions, wherein the higher order ambisonic encoder isconfigured to encode the second order ambisonic B-format direct soundsignals according to:${{R_{R}{{\underset{=}{\Delta}( {{3\sin^{2}\phi} - 1} )}/2}} = {{- 1}/2}},$${{R_{S}\underset{=}{\Delta}{\sqrt{3}/2}\cos{\theta sin}2\phi} = 0},$${{R_{T}\underset{=}{\Delta}{\sqrt{3}/2}\sin\theta\sin 2\phi} = 0},$${{R_{U}\underset{=}{\Delta}{\sqrt{3}/2}\cos 2\theta\cos^{2}\phi} = {{\sqrt{3}/2}\cos 2\theta_{XY}}},{and}$${{R_{V}\underset{=}{\Delta}{\sqrt{3}/2}\sin 2\theta\cos^{2}\phi} = {{\sqrt{3}/2}\sin 2\theta_{XY}}},$and wherein: R_(R) is a first, second-order ambisonic B-format directsound signal among the second order ambisonic B-format direct signals,R_(S) is a second, second-order ambisonic B-format direct sound signalamong the second order ambisonic B-format direct signals, R_(T) is athird, second-order ambisonic B-format direct sound signal among thesecond order ambisonic B-format direct signals, R_(U) is a fourth,second-order ambisonic B-format direct sound signal among the secondorder ambisonic B-format direct signals, R_(V) is a fifth, second-orderambisonic B-format direct sound signal among the second order ambisonicB-format direct signals,

denotes “defined as”, Φ is an elevation angle, and θ is an azimuthangle.
 7. The audio encoding device according to claim 1, comprising amicrophone matcher configured to perform a matching of the N frequencydomain audio signals, resulting in N matched frequency domain audiosignals.
 8. The audio encoding device according to claim 7, comprising adiffuse sound estimator configured to estimate a diffuse sound power,and a de-correlation filter bank configured to perform a de-correlationof the diffuse sound power by generating three orthogonal diffuse soundcomponents from the diffuse sound estimate power.
 9. The audio encodingdevice according to claim 8, wherein the diffuse sound estimator isconfigured to estimate the diffuse sound power according to:A = 1 − Φ_(diff)², B = 2Φ_(diff)E{X₁X₂^(*)} − E{X₁X₁^(*)} − E{X₂X₂^(*)}, C = E{X₁X₁^(*)}E{X₂X₂^(*)} − E{X₁X₂^(*)}², and${{P_{diff}\lbrack {k,i} \rbrack} = \frac{{- B} - \sqrt{B^{2} - {4{AC}}}}{2A}},$wherein: P_(diff) is the diffuse sound power, E{ } is an expectationvalue, Φ² _(diff) is a normalized cross-correlation coefficient betweenN₁ and N₂, N₁ is diffuse sound in a first channel, and N₂ is diffusesound in a second channel.
 10. The audio encoding device according toclaim 9, wherein the de-correlation filter bank is configured to performthe de-correlation of the diffuse sound power by generating threeorthogonal diffuse sound components from the diffuse sound estimatepower:{tilde over (D)} _(W)[k,i]=DFR_(W) w _(u) U ₁ P _(2D-diff)[k,i],{tilde over (D)} _(X)[k,i]=DFR_(X) w _(u) U ₂ P _(2D-diff)[k,i], and{tilde over (D)} _(Y)[k,i]=DFR_(Y) w _(u) U ₃ P _(2D-diff)[k,i],wherein:${{DFR}_{a}\overset{\Delta}{=}{\frac{1}{4\pi}{\int\limits_{- \frac{\pi}{2}}^{\frac{\pi}{2}}{\int\limits_{- \pi}^{\pi}{{{R_{a}( {\theta,\phi} )}}^{2}\mspace{14mu}\cos\;\phi\mspace{14mu} d\;\theta\mspace{14mu} d\;\phi}}}}},{{R_{X}( {\theta,\phi} )} = {\cos\mspace{14mu}\phi\mspace{14mu}\cos\mspace{14mu}\theta}},{{R_{Y}( {\theta,\phi} )} = {\cos\mspace{14mu}\phi\mspace{14mu}\sin\mspace{14mu}\theta}},{{R_{W}( {\theta,\phi} )} = 1},{and}$${{w_{u}\lbrack n\rbrack} = {{{{\exp( {- \frac{0.5\mspace{14mu}\ln\mspace{14mu} 1e\; 6\mspace{14mu}{n}}{f_{s}{RT}_{60}}} )}\mspace{14mu}{with}}\mspace{14mu} - l_{u}} < n < l_{u}}},$wherein {tilde over (D)}_(W)[k,i] is a first channel diffuse soundcomponent, wherein {tilde over (D)}_(X)[k,i] is second channel diffusesound component, wherein {tilde over (D)}_(Y)[k,i] is third channeldiffuse sound component, DFR_(W) is a diffuse-field response of thefirst channel, DFR_(X) is a diffuse-field response of the secondchannel, DFR_(Y) is a diffuse-field response of the third channel, w_(u)is an exponential window, RT₆₀ is a reverberation time, U₁,U₂,U₃ is thede-correlation filter bank, u is a Gaussian noise sequence, l_(u) is agiven length of the Gaussian noise sequence, and P_(2D-diff) is thediffuse noise power.
 11. The audio encoding device according to claim 1,comprising an adder, which is configured to add channel-wise, thefirst-order ambisonic B-format direct sound signals and the higher orderambisonic B-format direct sound signals, and/or the diffuse soundsignals, resulting in complete ambisonic B-format signals.
 12. The audioencoding device according to claim 1, wherein delay estimator configuredto estimate the angle of incidence for each pair of the N audio signalbased on a travelling time delay between the pair of audio signals. 13.The audio encoding device according to claim 1, wherein delay estimatorconfigured to estimate the angle of incidence for each pair of the Naudio signal based on a delay in second and a delay in samples betweenthe pair of audio signals.
 14. An audio recording device comprising theN microphones configured to record the N audio signals, and the audioencoding device according to claim
 1. 15. A method for encoding N audiosignals, from N microphones where N≤3, the method comprising: estimatingangles of incidence of direct sound by estimating for each pair of the Naudio signals an angle of incidence of the direct sound, derivingA-format direct sound signals from the estimated angles of incidence byderiving, from each of the estimated angles of incidence, a respectiveone of the A-format direct sound signals, each of the A-format directsound signals being a first-order virtual microphone signal, andencoding the A-format direct sound signals in first-order ambisonicB-format direct sound signals by applying a transformation matrix to theA-format direct sound signals, wherein N=3, wherein the encoding furthercomprises performing a short time Fourier transformation on each of theN audio signals x₁, x₂, x₃, resulting in N short time Fouriertransformed audio signals X₁[k,j], X₂[k,j], X₃[k,j], wherein the methodfurther comprises: determining cross spectra of each pair of the shorttime Fourier transformed audio signals according to:X ₁₂[k,i]=α_(X) X ₁ [k,i]X ₂ ^(*) [k,i]+(1−α_(X))X ₁₂ [k−1,i],X ₁₃ [k,i]=α _(X) X ₁ [k,i]X ₃ ^(*) [k,i]+(1−α_(X))X ₁₃ [k−1,i], andX ₂₃ [k,i]=α _(X) X ₂ [k,i]X ₃ ^(*) [k,i]+(1−α_(X))X ₂₃ [k−1,i],determining an angle of the complex cross spectrum of each pair of theshort time Fourier transformed audio signals according to:${{{\overset{\sim}{\psi}}_{12}\lbrack {k,i} \rbrack} = {\arctan j\frac{{X_{12}\lbrack {k,i} \rbrack}{X_{12}^{*}\lbrack {k,i} \rbrack}}{{X_{12}\lbrack {k,i} \rbrack} + {X_{12}^{*}\lbrack {k,i} \rbrack}}}},{{{\overset{\sim}{\psi}}_{13}\lbrack {k,i} \rbrack} = {\arctan j\frac{{X_{13}\lbrack {k,i} \rbrack}{X_{13}^{*}\lbrack {k,i} \rbrack}}{{X_{13}\lbrack {k,i} \rbrack} + {X_{13}^{*}\lbrack {k,i} \rbrack}}}},{{{and}{{\overset{\sim}{\psi}}_{23}\lbrack {k,i} \rbrack}} = {\arctan j\frac{{X_{23}\lbrack {k,i} \rbrack}{X_{23}^{*}\lbrack {k,i} \rbrack}}{{X_{23}\lbrack {k,i} \rbrack} + {X_{23}^{*}\lbrack {k,i} \rbrack}}}},$performing a phase unwrapping to {tilde over (ψ)} ₁₂ {tilde over (ψ)} ₁₃{tilde over (ψ)} ₂₃ , resulting in ψ₁₂ ψ₁₃ ψ₂₃ estimating the delay innumber of samples according to:δ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₂[k,i],δ₁₃[k,i]=(N _(STFT)/2+1)/(iπ)ψ₁₃[k,i],δ₂₃[k,i]=(N _(STFT)/2+1)/(iπ)ψ₂₃[k,i], if i≤i _(alias)orδ₁₂[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₂[k,i],δ₁₃[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₁₃[k,i],δ₂₃[k,i]=(N _(STFT)/2+1)/(iπ)Ψ₂₃[k,i], if i>i _(alias) estimating thedelay in seconds according to:${{\tau_{12}\lbrack {k,i} \rbrack} = \frac{\delta_{12}\lbrack {k,i} \rbrack}{f_{s}}},$${{\tau_{13}\lbrack {k,i} \rbrack} = \frac{\delta_{13}\lbrack {k,i} \rbrack}{f_{s}}},{and}$${\tau_{23}\lbrack {k,i} \rbrack} = \frac{\delta_{23}\lbrack {k,i} \rbrack}{f_{s}}$and estimating the angles of incidence according to:${{\theta_{12}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c{\tau_{12}\lbrack {k,i} \rbrack}}{d_{mic}} )}},$${{\theta_{13}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c{\tau_{13}\lbrack {k,i} \rbrack}}{d_{mic}} )}},{and}$${{\theta_{23}\lbrack {k,i} \rbrack} = {\arcsin( \frac{c{\tau_{23}\lbrack {k,i} \rbrack}}{d_{mic}} )}},$and wherein: x₁ is a first audio signal of the N audio signals, x₂ is asecond audio signal of the N audio signals, x₃ is a third audio signalof the N audio signals, X₁ is a first short time Fourier transformedaudio signal of the short time Fourier transformed audio signals, X₂ isa second short time Fourier transformed audio signal of the short timeFourier transformed audio signals, X₃ is a third short time Fouriertransformed audio signal of the short time Fourer transformed audiosignals, k is a frame of the short time Fourier transformed audiosignals, and i is a frequency bin of the short time Fourier transformedaudio signals, X₁₂ is a cross spectrum of a pair of X₁ and X₂, X₁₃ is across spectrum of a pair of X₁ and X₃, X₂₃ is a cross spectrum of a pairof X₂ and X₃, α_(x) is a forgetting factor, X* is a conjugate complex ofX, j is the imaginary unit, ψ₁₂ is an angle of the complex crossspectrum of X₁₂, ψ₁₃ is an angle of the complex cross spectrum of X₁₃,ψ₂₃ is an angle of the complex cross spectrum of X₂₃, i_(alias) is afrequency bin corresponding to an aliasing frequency, f_(s) is asampling frequency, d_(mic) is a distance of the microphones, and c isthe speed of sound.
 16. A non-transitory computer readable storagemedium comprising a computer program with a program code, which isconfigured to be executed by a computer to cause the computer to performthe method according to claim 15.